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1.  SUMMARY 


Network  security  is  becoming  increasingly  challenging  because  of  complexity  and  dynamics 
of  network  interactions.  Dynamic  network  analysis  (DNA)  is  an  emergent  scientific  field  in 
network  science.  There  has  been  an  increased  interest  in  data  stream  mining  based  DNA  for 
network  security.  The  major  challenge  of  this  dynamic  network  analysis  is  that  it  is  compu¬ 
tationally  expensive  to  extract  knowledge  structures  for  measuring  the  security  levels  of  large 
scale  dynamic  networks.  This  research  aims  to  develop  novel  technical  methods  in  order  to 
solve  this  challenge.  This  report  presents  practical  methodologies  of  data  stream  mining 
based  dynamic  link  anomaly  analysis  (DLAA)  using  novel  sliding  time  window  structures 
and  network  analytics  metrics. 

The  DNA  is  fundamentally  tied  to  spatio-temporal  analysis  and  the  proposed  DLAA  is 
the  specialized  network  security  research  area  of  the  DNA.  We  employ  spatio-temporal  link 
analysis  using  paired  data  such  as  prior  and  current  window  data  sets.  The  major  purpose  of 
applying  the  sliding  time  window  algorithm  is  to  process  manageable  and  meaningful  paired 
data  to  perform  unsupervised  learning.  Each  of  the  paired  data  sets  is  comprised  of  two 
types  of  data  such  as  spatio-temporal  data  and  spatial  data.  Spatial  data  structure  can  be 
modeled  as  a  graph  to  represent  spatial  data  that  is  a  subset  of  spatio-temporal  data.  For 
the  DEAD,  unsupervised  learning  is  appropriate  to  learn  and  measure  the  past  histories  of 
node-dependent,  path-  dependent  and  link  appearance  behaviors  of  dynamic  networks  from 
the  prior  window  data  and  predict  link  anomalies  in  the  current  window  data. 

For  the  DLAA,  we  have  developed  a  dynamic  link  anomaly  detection  (DLAD)  framework 
to  quantify  the  security  levels  of  dynamic  networks.  The  DLAD  framework  consists  of  three 
algorithmic  components  including  sliding  time  window,  link  scoring  and  link  anomaly  de¬ 
tection  algorithms.  These  components  are  operated  by  flow-based  programming  procedures. 
The  proposed  framework  was  tested  and  evaluated  using  the  VAST  2011  Mini  Challenge 
2  (MC2)  datasets,  including  a  case  study.  This  case  study  was  implemented  with  the  IP 
graphs  and  IP-port  graphs,  and  illustrates  the  data  analysis  techniques.  The  performance  of 
the  proposed  framework  is  fully  described  using  various  statistical  measures  (i.e.,  sensitivity, 
specihcity,  accuracy,  precision  and  error  rate,  etc.)  of  the  performance  of  a  binary  classi- 
hcation  test.  Through  performance  evaluation  in  the  case  study,  we  demonstrate  that  the 
proposed  framework  has  the  capability  to  construct  effective  knowledge  structures  for  mea¬ 
suring  the  security  levels  of  large  scale  dynamic  networks,  and  to  produce  effective  processed 
stream  data  for  generalized  DNA. 
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2.  INTRODUCTION 


The  analytic  computational  complexity  of  networks  has  become  increasingly  challenging  over 
the  past  decades  due  to  the  ever  growing  size  of  networks  and  the  explosion  of  information 
exchanged  over  the  Internet.  Large  scale  networks  can  suffer  from  threats  in  many  ways.  For 
example,  network  anomalies  such  as  failure  events  of  IP  backbone,  Denial  of  Service  (DoS) 
attacks,  worm  propagation,  and  network  equipment  outages  have  distinct  characteristics  but 
are  all  potential  threats,  which  may  strongly  affect  the  functionality  and  dependability  of 
networks  in  varying  degrees. 

In  areas  of  network  security  management,  network  intrusion  and  anomaly  detection  [7, 
13]  can  be  roughly  categorized  as  signature-based  or  data  mining-based.  Signature-based 
schemes  have  the  advantage  of  low  false  positives  if  signatures  are  well  dehned  patterns  in 
advance.  However,  these  schemes  are  not  appropriate  to  detect  zero-day  attacks.  There 
has  been  progress  in  using  network  behavior  anomaly  detection  (NBAD)  to  detect  or  pre¬ 
dict  unknown  anomalies.  As  an  integral  part  of  network  behavior  analysis,  the  NBAD  still 
needs  to  establish  knowledge  of  legitimate  network  connectivity  and  user  behaviors  over  a 
certain  period  of  time.  Machine  learning  based  anomaly  detection  may  be  promising  because 
predehned  signatures  are  not  required,  but  challenges  remaining  to  researchers  are  lack  of 
recent  attack  data  and  clean  (attack-free)  training  data,  difficulty  of  evaluation,  and  high 
cost  of  errors  [22].  There  have  been  anomaly  detection  works  looking  at  various  aspects  of 
network  anomalies.  We  note  that  the  difference  and  focus  of  this  work  is  on  detecting  link 
anomalies  in  dynamic  graphs  that  exhibit  on/off  patterns  based  only  on  network  topologies 
in  the  network  security  management  domain. 

Network  anomalies  have  been  manually  or  automatically  identihed  in  network  traffic  data  by 
applying  the  subspace  method  using  multivariate  time  series  of  byte  counts,  packet  counts, 
and  IP-flow  counts  [13] .  Neighborhood  formation  has  been  studied  for  bipartite  graphs  [23] . 
Leman  Akoglu  et  al.  [1]  proposed  the  OddBall  algorithm  to  find  anomalous  nodes  in  a 
graph.  The  authors  use  egonets,  that  is,  the  induced  subgraph  of  an  individual  node  (ego)  of 
interest  and  its  neighbors,  and  assign  an  outlierness  score  to  an  individual  node.  Power  law 
density,  weights,  ranks  and  eigenvalues  of  the  neighborhood  subgraphs  have  proved  useful 
for  anomaly  detection.  The  graph-theoretic  approach  [11]  has  also  been  applied  to  detect 
anomalies  in  email  networks  from  the  publicly  available  Enron  email  corpus. 

In  data  mining  domains,  link  mining  [4,  8]  and  link  prediction  algorithms  [17,  18]  have 
been  introduced.  In  particular,  there  has  been  recent  progress  in  link  prediction  and  its 
applications  for  a  variety  of  domains  such  as  social  networks  [10,  15],  co-authorship  networks 
[19],  healthcare  and  bioinformatics  [2].  However,  most  works  in  link  prediction  are  only 
interested  in  predicting  whether  a  pair  of  nodes  that  are  not  connected  previously  will  ever 
be  connected  in  the  future.  Although  link  prediction  algorithms  will  predict  the  feature 
connections  in  a  graph,  most  of  them  use  a  static  graph  as  learning  set  without  considering 
inherent  temporal  information  from  the  network  traffic  data.  Often,  the  network  presents 
various  dynamic  behaviors  [20]  and  traditional  link  prediction  does  not  work  well  on  datasets 
with  temporality.  Instead,  we  focus  on  the  task  of  detecting  link  anomalies  [1,  11,  21,  23,  24] 
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and  address  the  dynamic  ‘on/off”  behavior  of  link  anomalies,  e.g.,  whether  and  when  a 
previously  connected  link  will  become  disconnected.  The  malicious  activities,  which  we  refer 
to  as  anomalies,  may  be  related  to  faulty  hardware/ soft  ware,  misconhguration,  or  security 
related  events  caused  by  malicious  users  and  applications.  While  there  have  been  some 
efforts  in  anomaly-based  intrusion  detection  [7] ,  the  detection  of  link  anomalies  has  remained 
challenging. 

In  this  report,  we  introduce  the  network  analytics  metrics  and  sliding  time  window  data 
structures  for  the  data  stream  mining  in  order  to  incorporate  the  link  anomaly  detection  into 
the  DNA.  Using  knowledge  structures  from  sliding  time  window  data  and  network  analytics 
measures,  we  study  how  to  measure  and  analyze  network  security  on  topological  structures. 
For  better  defense  against  zero-day  attacks,  we  aim  to  utilize  behavioral  or  learning  based 
approaches,  such  as  a  link  prediction  algorithm  based  approach  which  analyzes  network  con¬ 
nections.  The  challenge  lies  in  the  fact  that  there  are  characteristics  of  anomaly  detection 
fundamentally  different  from  link  prediction  tasks  [18].  Although  link  prediction  [10,  15] 
may  predict  that  a  pair  of  nodes  not  connected  in  the  past  will  be  connected  sometime  in 
the  future,  it  does  not  consider  pairs  of  nodes  that  have  previously  been  connected.  The 
fundamental  dehciency  of  traditional  link  predictions  is  that  they  do  not  consider  the  dy¬ 
namics  of  network  connections,  e.g.,  the  on/off  patterns.  It  has  been  observed  that  networks 
are  not  only  becoming  much  larger  but  much  more  heterogeneous,  complex  and  dynamic 
as  well.  In  computer  networks,  the  massive  amount  of  traffic  among  the  computing  nodes 
is  constantly  changing.  For  example,  users  and/or  applications  may  establish  and  discon¬ 
nect  the  connections  at  any  time.  Different  types  of  anomalies  are  formally  modeled  by 
considering  every  possible  combination  of  previously  unlinked/linked  nodes  and  currently 
unlinked/linked  nodes.  In  each  sliding  time  window,  each  pair  of  nodes  is  assigned  a  link 
score  to  measure  its  importance.  The  link  scores  will  be  used  to  predict  whether  the  connec¬ 
tions  will  be  established  or  disconnected  in  the  next  time  phase  for  a  variety  of  situations. 
Based  on  the  actual  connectivity  in  a  current  snapshot  graph,  we  are  able  to  judge  whether 
each  link  is  anomalous  or  normal  by  examining  the  difference  between  the  expected  result  and 
reality.  Through  a  case  study  and  performance  evaluation  on  publicly  available  datasets,  we 
illustrate  the  effectiveness  of  the  link  anomaly  detection  algorithm  by  comparing  the  eval¬ 
uation  metrics.  We  conclude  that  while  this  research  has  immediate  benefits  for  network 
security  management  in  terms  of  security  investigation  and  troubleshooting,  the  proposed 
methodologies  are  general  enough  to  have  potential  impact  on  many  other  types  of  network 
applications.  Time-efficient  security- related  investigation  and  effective  troubleshooting  can 
be  achieved  by  using  the  sliding  time  window  based  DLAA.  For  instance,  network  managers 
and  administrators  are  able  to  monitor  manageable  amounts  of  latest  link  anomalies  for 
further  security  investigations  and  threat  preventions. 

The  main  contribution  of  this  paper  is  the  development  of  the  dynamic  link  anomaly  de¬ 
tection  (DLAD)  framework.  Three  algorithmic  components  of  the  framework  are  sliding 
time  window,  link  scoring  and  link  anomaly  detection  algorithms.  To  keep  the  algorithms 
generic,  our  approach  does  not  need  any  background  information  such  as  node  attributes, 
but  it  is  purely  based  on  network  topologies.  Intuitively,  the  frequency  of  link  appearances 
is  a  critical  factor  to  classify  link  anomalies.  It  is  reasonable  to  assume  that  the  connections 
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occurring  close  to  the  time  of  investigation  may  receive  more  emphasis  than  connections 
happening  much  earlier  due  to  the  temporal  locality  of  packet  exchanges  and  network  flows. 
Hence,  our  approaches  to  link  analysis  employs  temporal  dynamic,  similarity  and  centrality 
measurements  on  evolving  network  topologies  in  consecutive  sliding  time  windows.  These 
temporal  dynamic,  and  similarity  and  centrality  factors  are  combined  to  compute  an  overall 
link  score  as  a  coherent  link  anomaly  measurement  methodology. 


3.  METHODS,  ASSUMPTIONS,  AND  PROCEDURES 

3.1.  Network  Analytics  Metrics 

For  various  network  analysis  including  social  network  analysis,  node-dependent  and  path- 
dependent  metrics  are  generally  used  to  characterize  topological  structures  [18].  However, 
these  measurements  are  insufficient  to  characterize  the  dynamic  network  behaviors.  There¬ 
fore,  we  introduce  network  analytics  metrics  for  spatio-temporal  link  analysis  approach  to 
score  security  levels  of  dynamic  links  using  paired  sliding  time  window  data. 

In  this  section,  we  begin  by  introducing  the  network  analytics  metrics  using  graphs.  A 
graph  is  an  ordered  pair  G  =  (V,  E)  comprising  a  set  V  of  vertices  or  nodes  u,v  E  V 

together  with  a  set  E  of  edges  or  links  (n,  v)  G  E.  Let  G  =  (V,  E)  be  an  unweighted  graph 

that  represents  the  topological  structure  of  a  general  connected  network.  For  each  u  E  V, 
let  F(n)  =  {n  :  {u,v)  E  E},  the  set  of  vertices  connected  to  u.  In  addition,  for  each  I  <  1, 

pathsu^v  is  the  set  of  paths  from  u  to  n  consisting  of  I  edges  (path  of  length  /). 


3.1.1.  Node- Dependent  Metric 

The  Jaccard  index,  also  known  as  the  Jaccard  similarity  coefficient,  was  originally  designed 
to  measure  the  similarity  between  sample  sets.  For  this  report,  the  Jaccard  index  is  selected 
as  a  node- dependent  metric  to  characterize  nodes’  community  similarity.  It  is  dehned  as  the 
size  of  the  intersection  divided  by  the  size  of  the  union  of  the  sets  [5].  In  the  graph  G,  for 
each  pair  of  nodes  v,u  E  V,  the  Jaccard  index  is  dehned  as  the  ratio  between  the  number  of 
their  common  neighbors  and  the  number  of  total  neighbors,  namely: 


J{u,  v) 


irwnrwi 

|r(«)ur(,.)| 


(1) 


where,  as  previously,  the  cardinality  of  a  set  is  denoted  by  placing  absolute  value  signs  around 
the  set. 


In  our  approach,  J{u,v)  indicates  whether  network  nodes  u,v  eV,  such  as  clients,  servers, 
and  routers,  have  a  large  percentage  of  overlapping  destinations  regardless  of  the  absolute 
number  of  connecting  targets. 
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3.1.2.  Path-Dependent  Metrie 


The  Katz  index  is  useful  in  measuring  centrality  of  which  determines  the  relative  importance 
of  a  node  within  the  network.  The  Katz  index  is  selected  as  a  path-dependent  metric  in  order 
to  characterize  centrality  of  the  nodes  in  a  graph.  Originally  L.  Katz  [12]  proposed  a  path- 
ensemble  based  proximity  measure.  The  unweighted  Katz  index  is  used  to  measure  the 
centrality  of  a  node  by  measuring  a  variant  of  the  shortest-path  [5],  with  two  nodes  to  be 
strongly  related  if  their  connecting  paths  tend  to  be  short  in  length.  The  Katz  index  is 
dehned  as  [16] 


K{u,  v)  =  ■\vaihs^l]v\^u,v  eV,  (2) 

i=i 

where  0  <  /d  <  1  is  an  attenuation  parameter  [3]  ensuring  that  shorter  paths  contribute  more 
to  the  score  [16]. 

In  our  approach,  K{u,v)  is  computed  to  quantify  link  connectivity  of  nodes  u  and  v. 


3.1.3.  Link  Appearanee  Metrie 

The  nonlinear  weight  function  is  used  for  the  link  appearanee  metric  that  characterizes  the 
temporal  dynamics  of  connected  nodes  for  link  anomaly  detection  [14]. 

The  nonlinear  weight  function  is  dehned  as 


T 


=  1,2,...,T 


(3) 


w  if)  =  e 

where  A  is  an  attenuation  parameter  (i.e.,  positive  real  number),  and  t  is  the  snapshot 
graph  and  T  denotes  the  total  number  of  snapshot  graphs  in  the  prior  window. 


The  Link  Appearance  Index  L{u,v)  for  {u,v)  E  E  is  dehned  as 


L{u,  v) 


EILi  ^  (i)  •  Qt 

El=i  w  (i) 


(4) 


where  at  G  {0, 1}  represents  the  link  appearance  at  the  snapshot  in  the  prior  window. 
That  is,  =  1  is  if  the  link  is  present  (m,  v  connected)  and  at  =  0  ii  the  link  is  absent  {u,  v 
disconnected). 


Equation  4  yields  a  weighted  mean,  taking  into  account  the  fact  that  appearances  of  links 
at  later  snapshot  graphs  (or  in  other  words  closer  to  the  time  of  investigation)  should  have 
higher  weights  over  the  earlier  graphs.  It  is  reasonable  to  make  such  an  assumption  due  to 
the  inherent  temporal  locality  of  network  connections.  Schemes  such  as  network  caching  and 
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Cisco  NetFlow  also  assume  that  if  a  packet  is  observed  between  a  source-destination  IP/port 
pair,  it  is  likely  that  the  source  will  send  packets  again  to  the  destination  in  the  near  future. 
For  temporal  link  analysis,  higher  weights  are  assigned  for  later  link  appearances  because 
later  appearance  of  connections  is  more  important  than  earlier  ones. 

The  Node- dependent  metric  (i.e.,  Jaccard  index.  Equation  1)  and  the  path- dependent  metric 
(i.e.,  Katz  index.  Equation  2)  are  used  to  measure  node  similarity  and  link  centrality.  These 
metrics  are  insufficient  to  detect  anomalous  links  in  dynamic  and  complex  networks.  More¬ 
over,  the  complexity  of  computational  time  and  memory  are  crucial  factors  to  implement 
real  applications,  but  the  Katz  index  has  a  limitation  when  applied  to  real-world  networks 
because  its  cubic  time  complexity  0(|K|^),  where  \V\  is  the  number  of  vertices  in  K[3,  9], 
which  makes  it  infeasible  for  analyzing  a  large  network.  In  order  to  solve  the  fundamental 
cubic  complexity  problem  of  the  Katz  measure  and  make  the  calculation  more  efficient  for 
large  networks,  we  apply  a  heuristic  method  of  partial  simple  paths  by  using  the  maximum 
length  as  a  limiting  factor.  Paths  which  are  longer  than  the  maximum  value  will  not  be 
counted,  partly  due  to  the  fact  that  in  most  cases,  paths  of  shorter  lengths  will  dominate  the 
measure.  In  addition  to  these  metrics,  the  link  appearance  metric  is  also  used  to  implement 
feasible  and  efficient  link  anomaly  detection  mechanisms. 


3.2.  Dynamic  Link  Anomaly  Detection 

This  section  describes  the  overview  of  the  dynamic  link  anomaly  detection  framework,  and 
introduces  sliding  time  window,  link  scoring  and  anomaly  detection  algorithms. 

3.2.1.  Overview  of  the  Framework 

Figure  1  shows  data  flow  diagram  of  overall  processes  of  a  component-based  framework  for 
the  link  anomaly  detection.  Major  algorithmic  components  of  the  framework  consist  of 
three  algorithmic  components  that  are  sliding  time  window,  link  scoring,  and  link  anomaly 
detection  algorithms.  These  components  are  illustrated  in  Figure  1  and  their  algorithms  are 
introduced  in  this  section. 
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Figure  1.  Data  Flow  Diagram 
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Algorithm  1:  Sliding  Time  Window  Algorithm(r,  a;,  a;',  D,  i) 
Input: 

r  —  time  zone  size 
uj  —  prior  window  size 
uj'  —  current  window  size 
where  u  >  u' 

D  —  raw  network  flow  data:  spatio-temporal  data 
i  —  loop  counter 

Output: 

F  —  prior  Altered  data:  spatio-temporal  data 
F'  —  current  Altered  data:  spatio-temporal  data 
G  =  {V,  E)  —  prior  graph:  spatial  data 
G'  =  {V,  E')  —  current  graph:  spatial  data 

1  begin 

2  foreach  timestamp  G  ((i  —  1)  x  r,  (i  —  1)  x  r  -|-  u]  do 

3  -Di  t—  D {timestamp  u  ;  v  etc.) 

4  F  -(r-  Di{timestamp  u  ;  v) 

5  G(m,  v)  u,v 

6  end 

7  foreach  timestamp'  G  {ixr  +  u  —  u'Gxt  +  u]  do 

8  -Di  ^  D {timestamp'  u  ;  v  etc.) 

9  F'  -ir-  D {{timestamp'  u  v) 

10  G'{u,  v)  u,v 

11  end 

12  return  F,  G,  G' 

13  end 
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Algorithm  2:  Link  Scoring  Algorithm(/3,  7,y,G,G',F,0 
Input; 

/3  —  attenuation  parameter 
7, 7'  —  link  scoring  weights 
G  =  {V,  E)  —  prior  graph 
G'  =  (y,  E')  —  current  graph 
E  —  prior  hltered  data 
i  —  loop  counter 
Output; 

Si  —  link  score  table 

Si,j{u,v)  =  [class  ^u,v  u  v]  —  row  vector  of  Si 

1  begin 

2  initialize  S'* 

3  foreach  {{u,v)  ^  E)  A  {{u,v)  G  E')  do 

4  Ju,v  A-  Jaccard(M,  u,  G) 

5  Katz(M,  v,  G, /3) 

®  ^U,V  ^  (1  b  )  '  Ju^V  E  7  ■  ^u,v 

8  end 

9  foreach  {{u,v)  e  E)  A  {{u,v)  G  E')  do 

10  Ju,v  A-  Jaccard(M,  u,  G) 

11  ■(— LinkAppearance(M,  u,  F) 

42  ^u,v  ^  (1  d)  '  Ju,v  d  '  ^u,v 

13  Sij(^'ii,v)  ^  [keep  ^u,v  4/  "c] 

14  end 

15  foreach  {{u,v)  E  E)  A  {{u,v)  ^  E')  do 

16  Ju^v  A-  Jaccard(M,  V,  G) 

17  ■(— LinkAppearance(M,  V,  F) 

1®  ^U,V  ^  (1  d)  '  ’ku,V  F  d  '  k-lu,V 

19  ^  [tear  ^u,v  4/  17] 

20  end 

21  return  Si 

22  end 
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Algorithm  3:  Link  anomaly  detection  Algorithm(0,  G,  G',  S'*,  i) 
Input; 

O  {^build:  ^ Linkclass\keepi  ^tear}  tliresliolds 

G  =  (1/,  E)  —  prior  graph 
G'  =  {V',  E')  —  current  graph 
i  —  loop  counter 
Si  —  link  score  table 

Si,j{u,v)  =  [class  ^u,v  u  v]  —  row  vector  of  Si 

Output; 

Ai  —  link  anomaly  table 

A,j{u,v)  =  [classtruew false  ^u,v  U  v]  -  TOW  Vector  of  Ai 

1  begin 

2  initialize  Ai 

3  foreach  {{u,v)  ^  E)  A  {{u,v)  G  E')  do 

4  if  {^u,v  <  Obuiid)  then 

^  [btitld  true  ^u,v  ^ 

6  else 

^  [bwlld false  ^u,v  ^ 

8  end 

9  foreach  ((n,n)  G  -E)  A  ((n,n)  G  E')  do 

10  if  (^t,u,v  S  d l,inkclass\keep}  then 

11  A-  [keept  rue  ^u,v  ^ 

12  else 

13  A,j{u,v)  A-  [keepf  alse  ^u^v  '^1 

14  end 

15  foreach  {{u,v)  E  E)  A  {{u,v)  ^  E')  do 

16  if  >  Otear)  then 

17  -^i,j{u,v)  ^  \tCQ/T f  rue  ^u,v  ^ 

18  else 

19  -^i,j{u,v)  ^  [tCCl'Tj  alse  ^u^v  '^1 

20  end 

21  return  Ai 

22  end 
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The  flow-based  programming  procedures  for  the  link  anomaly  detection  are  as  follows: 

1.  Start  running  the  program. 

2.  Input  raw  network  traffic  data  D,  time  zone  size  r,  sliding  time  window  size  u,  link 
scoring  weights  7  and  7',  an  attenuation  parameter  (3  for  Katz  index,  and  a  link 
anomaly  detection  threshold  set  0  =  {9buiid,0Linkdass\keep,0tear}- 

3.  Send  input  data  sets  to  appropriate  algorithmic  components. 

4.  Initialize  loop  counter  [i  =  1). 

5.  Run  Component  1:  Sliding  Time  Window  Algorithm.  The  network  traffic  data  is 
parsed  to  get  the  prior  window  data  (i.e.,  Di,  F,  and  G),  and  the  current  window  data 
(i.e.,  Dl,  F',  and  G')  according  to  prior  and  current  sliding  window  intervals.  The 
prior  filtered  data  F,  and  prior  and  current  graphs  G  and  G'  are  sent  to  component  2 
and/or  component  3. 

6.  Run  Component  2:  Link  Seoring  Algorithm.  The  link  score  table  S'*  is  computed 
using  F,  G  and  G\  and  the  results  of  (i)  pairwise  nodes  connectivity  statuses,  (ii) 
node-dependent,  path-dependent  and  link  appearance  measures,  and  (iii)  combined 
weighted  measures  Si  are  sent  to  components  3  and  appended  to  S. 

7.  Run  Component  3:  Link  Anomaly  Detection  Algorithm.  The  link  anomaly  detection 
table  Ai  is  determined  by  G,  G",  Si  and  threshold  set  0  =  {ObuUd^Ounkciassikeep^Otear}- 
The  link  anomaly  detection  is  based  on  the  results  of  (i)  pairwise  nodes  connectivity 
statuses,  and  (ii)  link  score  and  thresholds  comparisons.  Ai  is  appended  to  A. 

8.  Increment  loop  counter  i  by  one. 

9.  Run  the  above  three  components  until  the  end  of  the  network  traffic  data  or  a  user 
stops  the  program. 


An  alternative  method  to  create  A  is  to  extract  A  from  S  with  subset  pointers  using  the  fact 
that  A  can  be  a  subset  of  S'  (A  C  S)  after  classifying  anomalies.  In  this  case,  S  and  subset 
pointers  are  stored  in  a  database  to  further  analyze  link  anomalies. 


3.2.2.  Sliding  Time  Window  Algorithm 

In  our  proposed  DLAD  framework  (see  Figure  1),  the  raw  network  traffic  data  D  is  processed 
to  form  two  types  of  sliding  time  window  based  processed  data,  namely  the  prior  window  data 
and  the  current  window  data  for  unsupervised  learning.  The  sliding  time  window  algorithm 
(see  Algorithm  1)  controls  the  raw  network  traffic  data  to  produce  the  processed  window 
data  with  paired  window  sizes  u  and  u'  and  time  zone  size  r  in  the  sliding  time  window 
data  processing.  The  prior  window  data  consist  of  the  prior  traffic  data  Fj,  prior  filtered 
data  F  and  prior  graph  (V,  E)  G  G,  and  the  current  window  data  consist  of  current  traffic 
data  D'-,  current  filtered  data  F'  and  current  graph  (K',  E')  G  G'.  The  filtered  data  contain 
the  spatial-temporal  information  (i,e,  nodes,  edges  and  times),  and  the  graphs  contain  the 
spatial  information  (i.e.,  nodes  and  edges).  The  prior  and  current  graphs  are  used  to  provide 
the  conditional  expressions  for  scoring  and  detecting  anomalous  links.  In  addition  to  the 
prior  and  current  graphs,  the  prior  filtered  data  is  also  used  to  weight  the  links  to  measure 
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Table  1:  Link  Classes  and  Mathematical  Notations 


Link  Class 

Mathematical  Expression 

Threshold 

build 

v)^E)  A  {{u, 

v)  E  E') 

^build 

keep 

v)  E  E)  A  {{u, 

v)  E  E') 

0  Linkclasslkeep 

tear 

v)  E  E)  A  {{u, 

v)  i  E') 

^tear 

never  linked 

v)^E)A  {{u, 

v)  i  E') 

N/A 

the  temporal  link  appearance.  Algorithm  1  shows  the  data  representations  of  sliding  time 
window  data  processing.  With  pointers,  dynamic  data  structures  of  sliding  time  widow  based 
processed  data  can  be  efficiently  constructed  or  updated. 

There  are  four  classes  of  a  link  status  for  the  paired  sliding  time  window  based  link  scoring 
and  link  anomaly  detection  algorithms,  as  follows: 

1.  build  ((u,v)  ^  E)  A  {{u,v)  G  E'): 

A  pair  of  nodes  u,v  E  V  was  not  connected  in  the  prior  graph,  but  the  pair  of  nodes 
are  built  the  new  link  in  the  current  graph. 

2.  keep  {{u,v)  E  E)  A  {{u,v)  E  E'): 

A  pair  of  nodes  u,v  E  V  was  connected  in  the  prior  graph  and  the  pair  of  nodes  are 
kept  the  link  in  the  current  graph. 

3.  tear  {{u,v)  E  E)  A  {{u,v)  ^  E'): 

A  pair  of  nodes  u,v  E  V  was  connected  in  the  prior  graph,  but  the  pair  of  nodes  are 
torn  down  the  link  in  the  current  graph. 

4.  never  link  {{u,v)  ^  E)  A  {{u,v)  ^  E')): 

A  pair  of  nodes  u,v  E  V  was  not  connected  in  both  prior  and  current  graphs,  never 
link  status  is  excluded  for  our  proposed  link  anomaly  detection. 

For  link  anomaly  detection,  unsupervised  learning  is  appropriate  to  learn  the  past  histories 
of  node-dependent,  path  dependent  and  link  appearance  behaviors  and  the  prior  and  current 
link  statuses.  With  the  link  statuses,  the  prior  link  scores  are  computed  in  Algorithm  2,  and 
the  anomalous  prior  links  are  detected  in  Algorithm  3. 


3.2.3.  Link  Scoring  Algorithm 

The  component  2  in  Figure  1  processes  pairwise  nodes  connectivity  comparison,  network 
analytics  measures  and  weighted  link  scoring  computation,  and  produces  the  link  score 
table  Si  (see  Algorithm  2). 

Two  pairs  of  metrics  are  selected  based  on  conditional  expressions  of  link  statuses  as  follows: 


•  build  {{u,v)  ^  E)  A  {{u,v)  E  E')  : 
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The  node- dependent  metrie  and  the  path- dependent  metric  are  used  to  compute  weighted 
link  scores.  In  this  case,  node  similarity  provides  limited  information  because  the  paired 
nodes  {u,  v}  are  not  connected  in  the  prior  graph  and  these  nodes  are  connected  in 
the  current  graph.  In  the  link  anomaly  detection,  it  is  important  to  learn  the  new  link 
path  because  the  attack  path  length  is  a  critical  factor  for  the  conventional  anomaly 
detections.  For  example,  longer  path  length  is  riskier  than  short  path  length  in  cyber 
attack  data.  Thus,  in  addition  to  the  node  similarity  measure,  the  path- dependent 
metric  is  applied  to  determine  whether  a  new  link  is  anomalous. 

•  keep  {{u,v)  e  E)  A  {{u,v)  G  E')  or  tear  {{u,v)  E  E)  A  {{u,v)  ^  E'): 

The  node- dependent  metric  and  the  link  appearance  metric  are  used  to  compute  weighted 
link  scores.  Link  status  of  keep  or  tear  may  not  be  stable  and  the  link  status  could 
change  in  a  short  time.  Thus,  in  addition  to  node  similarity  and  link  centrality  met¬ 
rics,  the  link  appearance  metric  is  applied  to  provide  temporal  information  of  links, 
i.e.,  times  and  frequencies  of  link  appearances  in  the  prior  hltered  data,  for  temporal 
link  analysis. 


The  computational  complexity  of  the  link  scoring  algorithm  is  polynomial  time  0(|I^P|i?|). 
Since  it  is  less  efficient  to  compute  scores  for  all  pairwise  nodes  in  the  prior  graph,  only 
nodes  that  are  actually  connected  in  the  current  graph  are  used  to  compute  the  link  scores, 
thus  \E\  times.  Relevant  methods  such  as  the  Jaccard,  Katz,  and  link  appearance  scores 
are  applied  for  each  link.  Due  to  the  cubic  complexity  time  of  the  Katz  measure  which 
dominates  the  time  complexity  (0(|K|^)),  the  overall  complexity  is  0(|Kp|i?|).  If  fast  Katz 
[6]  that  runs  at  0(|K|  -|-  |i?|)  is  used,  then  computational  complexity  could  be  further  reduced 
toO{\E\^  +  \V\\E\). 


3.2.4-  Link  anomaly  detection  Algorithm 

The  link  anomaly  detection  algorithm  (see  Algorithm  3)  processes  the  link  anomaly  table 
Ai  using  the  prior  and  current  graphs  G,  G',  and  the  link  score  table  S',  with  the  user- 
dehned  threshold  set  0.  Our  link  prediction  on  whether  a  link  in  a  network  graph  is  normal 
or  anomalous  is  based  on  the  conditional  expressions  of  the  link  statuses  in  the  prior  and 
current  graphs,  prior  link  scores,  and  thresholds.  The  link  anomaly  table  Ai  are  appended 
to  the  link  anomaly  table  A.  The  link  anomaly  detection  algorithm’s  computational  time 
complexity  is  0{\E\). 

The  link  score  table  S  and  the  link  anomaly  table  A  are  stored  in  the  network  security 
auditing  database  for  further  dynamic  network  analysis  and  evaluation  of  the  proposed  link 
scoring  and  detection  algorithms. 
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4.  RESULTS  AND  DISCUSSION 


4-1-  Case  Study 

In  this  section,  we  condnct  a  case  study  using  the  VAST  2011  MC2  datasets  that  consist  of 
three  days  of  hrewall  and  intrusion  detection  system  (IDS)  logs  as  core  information  data  from 
the  corporate  network.  These  datasets  are  used  to  test  and  evaluate  the  proposed  framework. 
The  task  is  to  identify  some  interesting  events  in  the  datasets  for  situational  awareness  in  the 
corporate  network.  In  this  case  study,  the  network  flow  data  for  log  examination  and  security 
is  investigated  to  show  how  the  proposed  framework  may  help  to  discover  link  anomalies. 

4.I.I.  Data  Sets  and  Proeedures 

Network  traffic  data  contain  important  information  such  as  source  (src)  and  destination  (dst) 
Internet  Protocol  (IP)  addresses,  port  numbers,  timestamps,  protocols,  etc.  We  primarily 
focus  on  link  anomaly  detection  using  IP  graphs.  In  an  IP  graph,  each  node  represents  one 
IP  address  and  each  edge  represents  a  network  connection  between  a  source  and  destination 
IP  pair.  We  can  also  construct  IP-port  graphs  by  introducing  the  port  numbers,  e.g.,  srcIP 
— )■  srcPort  — )■  dstPort  — )■  dstIP,  resulting  in  heterogeneous  graphs.  The  log  hies  are  loaded 
into  the  dynamic  link  anomaly  detection  framework  to  construct  two  types  of  connectivity 
graphs  with  either  IP  or  IP-port  nodes.  The  two  graphs  give  different  insights  on  the  networks 
with  a  tradeoff  of  granularity  and  complexity.  The  proposed  framework  employs  how-based 
programming  to  execute  components  using  the  paired  window  data  sets,  parameters,  and 
thresholds  for  the  link  anomaly  detection  illustrated  in  3.2.1. 


4-1.2.  Sliding  Time  Window  Size  Selection 

For  this  case  study,  let  the  prior  and  current  sliding  time  window  sizes  be  the  same,  that 
is,  u  =  u'.  Suppose  that  the  sliding  window  size  a;  =  ca'  is  15  minutes  and  the  time  zone  r 
is  5  minutes  (see  Figure  2).  With  these  parameters,  the  current  window  data  are  captured 
based  on  the  raw  network  how  data  in  the  current  15-minute  window  together  with  the 
overlapping  prior  15-minute  window  (see  Figure  1).  The  sliding  time  window  controller  will 
move  the  current  and  prior  windows  forward  5  minutes  for  capturing  network  how  data  in 
the  next  iteration.  The  selections  of  the  window  size  u  and  time  zone  size  r  are  an  important 
factor  to  generate  suitable  link  scores  for  the  link  anomaly  detection.  Figures  3  and  4  show 
a  comparison  between  the  link  appearance  scores  generated  with  a  larger  window  size  and 
a  smaller  window  size.  Figure  3  shows  the  scores  of  an  IP-port  graph  starting  at  2011-04-13 
12:07:00  in  a  1-minute  window  size  in  the  VAST  2011  MC2  datasets.  The  extremely  low 
link  appearance  scores  are  presented  due  to  lack  of  connectivity  records,  making  it  hard  to 
diherentiate  connections  for  the  dynamic  link  anomaly  detection.  Figure  4  increases  the 
graph  size  by  enlarging  the  time  window  to  15  minutes.  For  this  larger  window,  it  is  clear 
that  larger  scores  make  it  easier  to  diherentiate  connections  and  are  better  for  link  anomaly 
detection.  For  other  network  how  data,  a  user  should  select  other  u  and  r  because  link 
scores  can  vary  depending  on  the  volume  of  network  how  data. 
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Debian  GNU/Linux  comes  with  . 
permitted  by  applicable  law. 
Last  login;  Mon  Sep  9  15:01 
t  •  .  8ops-pel01d30060 : ~$  r 
■  •  .  @cps-pel01d30060:~/5r- 
LinkAnomaly  by 


This  is  the  configure  file 
ht  hand  side  of  ■  =  ' : 


Setting: 


'grix/ggenSS/dclPof  IT  FT_0  Id 


l^dctPortTFTxOld_5_2011-04-13  09;12;(X)_2011-04-13  09:27;00.csv 
l^dctPortTFTxOld.4.2011-04-13  09;07;00.2011-04-13  09;22:00.csv 
l^dctPortTFTxOld_3_2011-04-13  09;02;00_2011-04-13  09;17;00.csv 
l^dctPortTFTxOld_2_2011-04-13  08;57;00_2011-04-13  09;12;00.csv 
l§3dctPortTFTxOld_l_2011-04-13  08;52;00_2011-04-13  09;07:00.csv 


Window  Size  =  15 
Time  Zone  Size  =  5 

percentage  of  Upper  Bound  for  ip  =  0.11 

percentage  of  Lower  Bound  for  ip  =  0.89 

percentage  of  Upper  Bound  for  port  =  0.11 

percentage  of  Lower  Bound  for  port  =  0.89 

Data  File  Time  Format  =  %d/%b/%Y  %H:%M:%S 
Evaluation  File  Time  Format  =  %d/%b/%Y  %H:%M:%S.%f 
Single  Data  File  (Y  or  N)  =  Y 
Single  Evaluation  File  (Y  or  N)  =  Y 


Input : 

Single  Data  File  Address  =  /home/  •  •  ./gnx/nxny_ 
n^le_data_lite . csv 

Single  Evaluation  File  Address  *  /home/  •  .  /rr.x 
ata/Sample_^evaluation .  csv 

Multiple  Data  Files'  List  Address  = 

Multiple  Evaluation  Files*  List  Address  = 


Output : 


Result  Directory  =  /home/ 
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More— (81%) 


■'’gnx/ggenSS 


V.. 
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dctIPTFT_Trd 
0  dctiPTFT.OId 
dctIPKatz 


Figure  2.  Parameter  Setting 
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Figure  3.  Link  Appearance  Indexes  of  an  IP-Port  Graph  u  =  Imin 
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Figure  4. 


Link  Appearance  Indexes  of  an  IP-Port  Graph  oo 
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4.1.3.  Thresholds  for  Cyber  Attaek  Detection 

The  link  score  table  S  and  the  link  anomaly  table  A  are  automatically  generated.  A  user  may 
further  analyze  the  suspicious  links  in  a  specihc  time  range.  For  example,  Figure  5  partially 
shows  a  link  score  table  that  is  ranked  on  day  1  (2011-4-13).  Due  to  the  space  limitation 
on  the  horizontal  axis,  only  a  few  paired  IP  addresses  and  their  link  scores  are  displayed. 
The  node  192.168.1.14  and  its  associated  nodes  were  built  new  connections  and  their  link 
scores  are  very  low  during  port  scan  attacks.  Figure  5  indicates  that  6i,uiid  <  4.00F^  —  05 
is  a  potential  threshold  to  predict  port  scan  attacks  or  other  attacks.  A  group  of  machines 
with  IP  addresses  ranging  from  192.168.2.11  to  192.168.2.138  connect  to  three  particular 
machines  with  IP  192.168.1.2,  192.168.1.6,  and  192.168.1.14.  Figure  6  shows  this  group  and 
highlighted  paired  nodes  that  indicate  conhrmed  port  scan  activities  from  the  IDS  log.  In 
addition,  workstations  192.168.2.171-175  are  also  the  sources  for  many  port  scans  to  other 
hosts  in  the  subnet.  These  link  anomalies  are  conhrmed  as  compromised  machines  which 
are  starting  to  conduct  port  scanning  and  Denial  of  Service  (DoS)  attacks  in  the  IDS  log. 
Turning  to  the  other  side  of  the  connection  score  distribution.  Figure  7  presents  the  DoS 
attack  related  datasets  and  Otear  >  0.1  as  a  potential  threshold  to  predict  the  DoS  attacks. 
There  was  an  attempted  DoS  attack  against  the  corporate  web  server  172.20.1.15  for  links 
with  10.200.150.201,  206-9  on  day  1  (2011-4-13).  As  a  result  of  DoS  attacks,  the  corporate 
web  server  172.20.1.15  was  shut  down.  Subsequently  all  links  that  are  connected  to  node 
172.20.1.15  were  torn  down  for  a  short  time  and  yielded  high  link  scores.  While  these 
examples  in  the  case  study  are  results  of  cyber  attacks,  the  proposed  framework  could  be 
useful  in  detecting  any  anomalous  connections  due  to  hardware  failures  or  misconhgurations. 
Therefore,  it  may  also  beneht  other  network  management  tasks  such  as  troubleshooting  and 
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Figure  6.  Link  Score  Tables 
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Figure  8.  Multiclass  Tree  Representation  of  4  x  4  Confusion  Matrix 

diagnosing. 


J^.2.  Performance  Evaluation 
4.2.1.  Link  Anomaly  Classifications 

Figure  8  depicts  multiclass  tree  representation  of  4x4  confusion  matrix  that  shows  a  decision- 
tree-like  structure  to  illustrate  the  hierarchy  of  relationships  regarding  the  4  classes  of  a 
link  status,  the  8  predictive  classes  and  the  16  instances  of  actual  classification  results. 
(Theoretically,  for  each  link,  there  could  be  16  classification  results  for  4  class-link  anomaly 
detection.)  The  4  classes  of  each  link  status  are  build,  keep,  tear  and  never  linked,  defined 
in  Section  3.2.2.  However,  the  never  linked  case  is  excluded  in  our  current  dynamic  link 
anomaly  analysis,  and  will  instead  be  investigated  in  future  research. 

In  predictive  analytics,  a  confusion  matrix  is  used  to  visualize  the  performance  of  a  binary 
classifier.  For  each  class  of  a  link  status,  there  are  two  predictive  classes:  anomalous  (positive) 
or  normal  (negative).  For  each  link  anomaly  prediction,  the  prediction  can  turn  out  to  be 
either  true  or  false,  resulting  in  4  instances  such  as  true  positive,  false  positive,  true  negative 
and  false  negative.  These  instances  are  defined  (see  Figure  8)  as  follows: 


•  True  Positive  (TP):  Anomalous  link  (P)  is  correctly  identified. 

•  False  Positive  (FP):  Anomalous  link  (P)  is  incorrectly  identified. 

•  True  Negative  (TN):  Normal  link  (N)  is  correctly  identified  and  rejected. 

•  False  Negative  (FN):  Normal  link  (N)  is  incorrectly  identified  and  rejected. 
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Figure  9.  Hierarchical  Aggregated  Functions  of  Confusion  Matrices 
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Table  2:  Terminology  and  Derivations  of  Classification  Functions  from  Confusion 
Matrix 


Terminology 


Classification  Function 


Sensitivity 

True  Positive  Rate 

TPR  = 

TP 

TP 

(5) 

P 

~  TP +  FN 

Specificity 

True  Negative  Rate 

TNR  = 

TN 

TN 

(6) 

N 

~  TN  +  FP 

1-Specificity 

False  Positive  Rate 

FPR  = 

FP 

FP 

(7) 

N 

~  FP  +  TN 

Precision 

Positive  Predictive  Value 

PPT/  — 

TP 

(8) 

j.  j.  \ 

TP +  FP 

Accuracy 

Accuracy  Rate 

,nn-TP  +  TN  _ 

TP +  TN 

(9) 

P  +  N 

TP  +  FN  +  TN  +  FP 

Error  Rate  Recognition  Rate 


ERR  = 


FP  +  FN 
P  +  N 


FP  +  FN 

TP  +  FN  +  TN  +  FP 


(10) 


In  this  link  anomaly  research,  four  connection  statuses  (see  8  are  identified  and  three  of  these 
connection  statuses  are  used.  These  3-class  link  statuses  in  the  prior  and  current  graphs  are 
used  to  determine  link  scores  and  detect  link  anomalies.  For  overall  performance  evaluation, 
the  hierarchical  aggregated  functions  of  confusion  matrices  regardless  of  link  statuses  are 
introduced  to  aggregate  the  classification  results  during  the  iterations  or  after  completion 
of  iterations  (see  Figure  9).  Note  that  \TP\ij  +  \FP\ij  +  \TN\ij  +  \FN\ij  =  1  for  each 
link.  The  hierarchical  aggregated  functions  enable  one  to  convert  multiclass  classification  to 
binary  classification.  From  additional  datasets  such  as  IDS  log,  we  are  able  to  evaluate  the 
link  anomaly  detection  results  via  a  confusion  matrix. 

Widely  recognized  classification  functions  (see  Table  2)  are  selected  to  evaluate  the  proposed 
framework.  These  functions  include  sensitivity  (true  positive  rate,  TPR  in  Equation  5), 
specificity  (true  negative  rate,  TNR  in  Equation  6),  1-specificity  (false  positive  rate,  FPR  in 
Equation  7),  precision  (positive  predictive  value,  PPV  in  Equation  8),  accuracy  (accuracy 
rate,  ACC  in  Equation  9),  and  error  rate  (recognition  rate,  ERR  in  Equation  10). 
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Figure  10.  3-Threshold  Selection  Procedure  Using  a  Binary  Classifier 

4-2.2.  Multi- Thresholds  Selection 

Figure  10  illustrates  the  thresholds  selection  procedures.  S'  is  a  link  score  table  and  SbuUd, 
Shinkciassikeep,  Stear  are  extracted  from  S.  Each  class  link  score  table  is  individually  ranked. 
A  simple  strategy  can  be  used  to  End  0  =  {6buiid,0Linkdass\keep,0tear},  and  use  Receiver 
Operating  Characteristic  (ROC)  curve  is  used  to  determine  the  best  0.  Thresholds  selection 
procedures  are  as  follows: 

1.  Let  'ipP  ^  'ipP  is  the  percentage  of  link  score  sets  where  p  =  1,  2,  •  ■  ■  ,P. 

2.  Select  0^  =  ^LnfccZa.s!fceep)  Car}  by  mapping  the  ijP  %  of  Sbuild,  Sbuild  and  Stear- 

3.  Run  Algorithm  3  with  0^. 

4.  Use  the  aggregated  functions  in  Figure  9  to  create  p*^  confusion  matrix. 

5.  Compute  TPBF  and  FDRF. 

6.  Plot  ROC  curve  using  all  TPW  and  FDRP,  and  determine  the  best  ip^. 

T.  hlap  the  best  'Ip'^  to  select  0  \dbuild^d]^ijikclass\keep^dtear^. 

4-2.3.  Performance  of  Classification 

The  ROC  curve  is  an  effective  method  to  select  the  best  operating  point  for  a  binary  classi¬ 
fier.  We  use  the  ROC  curve  to  determine  three  thresholds  to  classify  the  three  link  classes. 
Estimation  of  multi-thresholds  for  link  anomaly  detection  is  challenging  because  of  dynamic 
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Figure  11.  Alternative  Comparison  of  Various  Thresholds  in  ROC  Curve 
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Figure  12.  Performance  Evaluation  Results 


link  behaviors.  Since  VAST  2011  MC2  datasets  were  produced  by  simulating  normal  ac¬ 
tivity  and  attacks,  we  apply  the  simple  method  of  multi-thresholds  estimation,  instead  of 
using  conventional  multi-thresholds  estimation  techniques.  The  proposed  approach  is  de¬ 
scribed  in  Section  4.2.2.  Figure  11  presents  the  operating  points  (i.e.,  "0^  G  and  the 
ROC  curve  indicates  that  ip‘^  =  25%  is  the  best  operating  point  for  mapping  the  thresholds 
=  {^buUd^^Linkclass\keep^^tar}  that  CaU  be  USed  aS  0  =  {ebuild,eLinkclass\keep,Otear}-  For  the 
DLAA,  p  >  5  (i.e.,  "0^  >  25%)  can  be  selected  to  analyze  more  suspicious  links  in  the  VAST 
2011  MC2  datasets.  This  multi-thresholds  selection  method  is  inappropriate  for  real  network 
traffic  data,  but  it  is  sufficient  to  demonstrate  the  effectiveness  of  sliding  time  window  based 
data  stream  mining. 

After  selecting  the  threshold  set  0,  we  perform  300  rounds  of  link  anomaly  detections  with 
the  sliding  time  window  starting  from  the  beginning  of  the  VAST  2011  MC2  network  flow 
data  and  moving  towards  the  end  of  the  data.  For  each  time  window,  paired  prior  and  current 
window  data  are  updated.  The  prior  window  data  are  used  for  learning  and  the  current 
window  data  are  used  for  predicting  link  anomalies.  The  IDS  log  is  used  for  verification 
purposes. 

For  each  round  of  testing,  accuracy,  error  rate,  sensitivity  and  specificity  are  measured 
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using  the  aggregated  confusion  matrix  for  overall  performance  of  link  anomaly  predictions 
(see  Figure  9).  These  measurements  are  used  to  evaluate  the  proposed  framework.  Figure 
12  shows  that  the  experimental  results  yielded  roughly  75%  overall  accuracy,  indicating 
the  effectiveness  of  scoring  and  detecting  the  link  anomaly  using  the  proposed  framework. 
The  performance  of  the  framework  may  be  degraded  unless  parameters  and  thresholds  are 
adaptable  for  the  real  dynamic  link  data.  However,  the  framework  is  appropriate  to  hlter 
and  analyze  anomalous  links  by  adjusting  threshold  sets  ©and  parameters. 


5.  CONCLUSION 

The  DLAD  framework  of  this  report  was  developed  to  quantify  the  security  levels  of  dy¬ 
namic  networks,  and  evaluated  to  show  the  effectiveness  of  data  stream  mining  for  the  DNA. 
Through  performance  evaluation,  we  demonstrated  that  the  proposed  framework  has  a  capa¬ 
bility  to  construct  effective  knowledge  structures  by  measuring  the  security  levels  of  dynamic 
networks,  and  hltering  anomalous  or  suspicious  links  from  massive  network  traffic  data.  In 
addition,  the  proposed  sliding  time  window  based  method  produces  useful  processed  data 
for  data  stream  mining  based  generalized  dynamic  network  analysis.  This  DLAA  method 
needs  to  be  further  developed  to  incorporate  statistical  based  data  stream  mining. 
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